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Abstract 

Many real-world phenomena can be represented by a spatio-temporal signal: 
where, when, and how much. Social media is a tantalizing data source for those 
who wish to monitor such signals. Unlike most prior work, we assume that the tar- 
get phenomenon is known and we are given a method to count its occurrences in 
social media. However, counting is plagued by sample bias, incomplete data, and, 
paradoxically, data scarcity - issues inadequately addressed by prior work. We for- 
mulate signal recovery as a Poisson point process estimation problem. We explicitly 
incorporate human population bias, time delays and spatial distortions, and spatio- 
temporal regularization into the model to address the noisy count issues. We present 
an efficient optimization algorithm and discuss its theoretical properties. We show 
that our model is more accurate than commonly-used baselines. Finally, we present 
a case study on wildlife roadkill monitoring, where our model produces qualitatively 
convincing results. 

1 Introduction 

Many real-world phenomena of interest to science are spatio-temporal in nature. They can be char- 
acterized by a real- valued intensity function f E K>o, where the value f s j quantifies the prevalence 
of the phenomenon at location s and time t. Examples include wildlife mortality, algal blooms, 
hail damage, and seismic intensity. Direct instrumental sensing of f is often difficult and expensive. 
Social media offers a unique sensing opportunity for such spatio-temporal signals, where users serve 
the role of "sensors" by posting their experiences of a target phenomenon. For instance, social me- 
dia users readily post their encounters with dead animals: "I saw a dead crow on its back in the 
middle of the road. " 

There are at least three challenges faced when using human social media users as sensors: 

1. Social media sources are not always reliable and consistent, due to factors including the 
vagaries of language and the psychology of users. This makes identifying topics of interest 
and labeling social media posts extremely challenging. 

2. Social media users are not under our control. In most cases, users cannot be directed or 
focused or maneuvered as we wish. The distribution of human users (our sensors) depends on 
many factors unrelated to the sensing task at hand. 



3. Location and time stamps associated with social media posts may be erroneous or missing. 
Most posts do not include GPS coordinates, and self-reported locations can be inaccurate or 
false. Furthermore, there can be random delays between an event of interest and the time of 
the social media post related to the event. 

Most prior work in social media event analysis has focused on the first challenge. Sophisticated 
natural language processing techniques have been used to identify social media posts relevant to 
a topic of interest [32l [3l [25] and advanced machine learning tools have been proposed to discover 
popular or emerging topics in social media [H [TBI ESj. We discuss the related work in detail in 
Section 03 

Our work in this paper focuses on the latter two challenges. We are interested in a specific topic 
or target phenomenon of interest that is given and fixed beforehand, and we assume that we are also 
given a (perhaps imperfect) method, such as a trained text classifier, to identify target posts. The 
first challenge is relevant here, but is not the focus of our work. The main concerns of this paper 
are to deal with the highly non- uniform distribution of human users (sensors), which profoundly 
affects our capabilities for sensing natural phenomena such as wildlife mortality, and to cope with 
the uncertainties in the location and time stamps associated with related social media posts. The 
main contribution of the paper is robust methodology for deriving accurate spatiotemporal maps 
of the target phenomenon in light of these two challenges. 

2 The Socioscope 

We propose Socioscope, a probabilistic model that robustly recovers spatiotemporal signals from 
social media data. Formally, consider f defined on discrete spatiotemporal bins. For example, a 
bin could be a U.S. state s on day t, or a county s in hour t. From the first stage we 

obtain x Si t, the count of target social media posts within that bin. The task is to estimate f s j 
from x Sj t. A commonly-used estimate is f s j = x s j itself. This estimate can be justified as the 
maximum likelihood estimate of a Poisson model x ~ Poisson(f). This idea underlines several 
emerging systems such as earthquake damage monitoring from Twitter [12]. However, this estimate 
is unsatisfactory since the counts x s j can be noisy: as mentioned before, the estimate ignores 
population bias - more target posts are generated when and where there are more social media 
users; the location of a target post is frequently inaccurate or missing, making it difficult to assign 
to the correct bin; and target posts can be quite sparse even though the total volume of social 
media is huge. Socioscope addresses these issues. 

For notational simplicity, we often denote our signal of interest by a vector f = (/i, . . . , f n ) T E 
M> , where fj is a non-negative target phenomenon intensity in source bin j = 1 . . . n. We will use 
a wildlife example throughout the section. In this example, a source bin is a spatiotemporal unit 
such as "California, day 1," and fj is the squirrel activity level in that unit. The mapping between 
index j and the aforementioned (s,t) is one-one and will be clear from context. 

2.1 Correcting Human Population Bias 

For now, assume each target post comes with precise location and time meta data. This allows us 
to count Xj, the number of target posts in bin j. Given Xj, it is tempting to use the maximum 
likelihood estimate fj = Xj which assumes a simple Poisson model Xj ~ Poisson(/j). However, this 
model is too naive: Even if fj = e.g., the level of squirrel activities is the same in two bins, we 
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would expect Xj > x^ if there are more people in bin j than in bin fc, simply because more people 
see the same group of squirrels. 

To account for this population bias, we define an "active social media user population intensity" 
(loosely called "human population" below) g = (#i, . . . , g n ) T E M> . Let Zj be the count of all social 
media posts in bin j, the vast majority of which are not about the target phenomenon. We assume 
zj ^ Poisson(^j). Since typically zj ^> 0, the maximum likelihood estimate gj = zj is reasonable. 

Importantly, we then posit the Poisson model 

xj ~ Poisson(r/(/j, (1) 

The intensity is defined by a link function r](fj,gj). In this paper, we simply define rj(fj,gj) = fj -gj 
but note that other more sophisticated link functions can be learned from data. Given xj and Zj, 
one can then easily estimate fj with the plug-in estimator fj = Xj/zj. 



2.2 Handling Noisy and Incomplete Data 

This would have been the end of the story if we could reliably assign each post to a source bin. 
Unfortunately, this is often not the case for social media. In this paper, we focus on the problem 
of spatial uncertainty due to noisy or incomplete social media data. A prime example of spatial 
uncertainty is the lack of location meta data in posts from Twitter (called tweets) Q In recent 
data we collected, only 3% of tweets contain the latitude and longitude at which they were created. 
Another 47% contain a valid user self-declared location in his or her profile (e.g., "New York, NY"). 
However, such location does not automatically change while the user travels and thus may not be 
the true location at which a tweet is posted. The remaining 50% do not contain location at all. 
Clearly, we cannot reliably assign the latter two kinds of tweets to a spatiotemporal source bin. ^\ 
To address this issue, we borrow an idea from Positron Emission Tomography [28]. In particular, 
we define m detector bins which are conceptually distinct from the n source bins. The idea is that 
an event originating in some source bin goes through a transition process and ends up in one of the 
detector bins, where it is detected. This transition is modeled by an m x n matrix P where 

Pij = Pr(detector i | source j). (2) 

P is column stochastic: Y^iLi Pij ~ 1 , V j . We defer the discussion of our specific P to a case study, 
but we mention that it is possible to reliably estimate P directly from social media data (more on 
this later). Recall the target post intensity at source bin j is rj(fj,gj)- We use the transition matrix 
to define the target post intensity hi (note that hi can itself be viewed as a link function r/(f, g)) 
at detector bin i: 

n 

hi = J2 P v r >(fj'9j)- (3) 



1 It may be possible to recover occasional location information from the tweet text itself instead of the meta data, 
but the problem still exists. 

2 Another kind of spatiotemporal uncertainty exists in social media even when the local and time meta data of 
every post is known: social media users may not immediately post right at the spot where a target phenomenon 
happens. Instead, there usually is an unknown time delay and spatial shift between the phenomenon and the post 
generation. For example, one may not post a squirrel encounter on the road until she arrives at home later; the local 
and time meta data only reflects tweet-generation at home. This type of spatiotemporal uncertainty can be addressed 
by the same source-detector transition model. 
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For the spatial uncertainty that we consider, we create three kinds of detector bins. For a 
source bin j such as "California, day 1," the first kind collects target posts on day 1 whose latitude 
and longitude meta data is in California. The second kind collects target posts on day 1 without 
latitude and longitude meta data, but whose user self-declared profile location is in California. 
The third kind collects target posts on day 1 without any location information. Note the third 
kind of detector bin is shared by all other source bins for day 1, such as "Nevada, day 1," too. 
Consequently, if we had n — 50T source bins corresponding to the 50 US states over T days, there 
would be m = (2 x 50 + 1)T detector bins. 

Critically, our observed target counts x are now with respect to the m detector bins instead of 
the n source bins: x = (xi, . . . , x m ) T . We will also denote the count sub- vector for the first kind of 
detector bins by x^ 1 ), the second kind x^ 2 ), and the third kind x^ 3 ). The same is true for the overall 
counts z. A trivial approach is to only utilize x^ 1 ) and z^ 1 ) to arrive at the plug-in estimator 

fs = xf/zf\ (4) 

As we will show, we can obtain a better estimator by incorporating noisy data x^ 2 ) and incomplete 
data x( 3 ). z« is sufficiently large and we will simply ignore zS > and tS 6 K 



2.3 Socioscope: Penalized Poisson Likelihood Model 

We observe target post counts x = (xi, . . . , x m ) in the detector bins. These are modeled as inde- 
pendently Poisson distributed random variables: 

Xi ~ Poisson(/i^), for i = 1 . . . m. (5) 

The log likelihood factors as 

m j X j —hi 171 

^(f) = logJI-^— = ^2(x i \ogh i -h i ) + c, (6) 

1=1 1=1 

where c is a constant. In ^ we treat g as given. 

Target posts may be scarce in some detector bins. Indeed, we often have zero target posts for 
the wildlife case study to be discussed later. This problem can be mitigated by the fact that many 
real- world phenomena are spatiotemporally smooth, i.e., "neighboring" source bins in space or time 
tend to have similar intensity. We therefore adopt a penalized likelihood approach by constructing a 
graph-based regularizer. The undirected graph is constructed so that the nodes are the source bins. 
Let W be the n x n symmetric non-negative weight matrix. The edge weights are such that wjk is 
large if j and k correspond to neighboring bins in space and time. Since W is domain specific, we 
defer its construction to the case study. 

Before discussing the regularizer, we need to perform a change of variables. Poisson intensity f is 
non-negative, necessitating a constrained optimization problem. It is more convenient to work with 
an unconstrained problem. To this end, we work with the exponential family natural parameters 
of Poisson. Specifically, let 

Bj = log fj, ipj = loggj. (7) 
Our specific link function becomes rj(0j^j) = e^ + ^. The detector bin intensities become hi = 
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Our graph-based regularizer applies to 9 directly: 

Q(e) = ±e T Le, (8) 

where L is the combinatorial graph Laplacian |7j: L = D — W, and D is the diagonal degree matrix 

With Djj = J2k=l W Jk- 

Finally, Socioscope is the following penalized likelihood optimization problem: 

m 

min - ^ (xi log hi - hi) + XQ(0), (9) 

i=l 

where A is a positive regularization weight. 

2.4 Optimization 

We solve the Socioscope optimization problem ^ with BFGS, a quasi-Newton method [20J. The 
gradient can be easily computed as 

V = ALfl-HP T (r- 1), (10) 
where r = {r\ . . .r m ) is a ratio vector with T{ — Xi/hi, and H is a diagonal matrix with Hjj = 

We initialize 9 with the following heuristic. Given counts x and the transition matrix P, we 
compute the least-squared projection 770 to ||x — Pr/olb- This projection is easy to compute. However, 
770 may contain negative components not suitable for Poisson intensity. We force positivity by 
setting 770 ^— max(10 -4 , 770) element-wise, where the floor 10 -4 ensures that log 770 > — 00. From the 
definition rj(9, ifi) — exp(# + ^), we then obtain the initial parameter 

9 = log r/o - ^- (11) 

Our optimization is efficient: problems with more than one thousand variables (n) are solved in 
about 15 seconds with fminuncQ in Mat lab. 

2.5 Parameter Tuning 

The choice of the regularization parameter A has a profound effect on the smoothness of the esti- 
mates. It may be possible to select these parameters based on prior knowledge in certain problems, 
but for our experiments we select these parameters using a cross-validation (CV) procedure, which 
gives us a fully data-based and objective approach to regularization. 

CV is quite simple to implement in the Poisson setting. A hold-out set of data can be constructed 
by simply sub-sampling events from the total observation uniformly at random. This produces a 
partial data set of a subset of the counts that follows precisely the same distribution as the whole 
set, modulo a decrease in the total intensity per the level of subsampling. The complement of the 
hold-out set is what remains of the full dataset, and we will call this the training set. The hold-out 
set is taken to be a specific fraction of the total. For theoretical reasons beyond the scope of this 
paper, we do not recommend leave-one-out CV [27| 18]. 

CV is implemented by generating a number of random splits of this type (we can generate as 
many as we wish), and for each split we run the optimization algorithm above on the training set for 
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a range of values of A. Then compute the (unregularized) value of the log-likelihood on the hold-out 
set. This provides us with an estimate of the log-likelihood for each setting of A. We simply select 
the setting that maximizes the estimated log-likelihood. 

2.6 Theoretical Considerations 

The natural measure of signal-to-noise in this problem is the number of counts in each bin. The 
higher the counts, the more stable and "less noisy" our estimators will be. Indeed, if we directly 
observe x\ ~ Poisson(Zi^), then the normalized error E[(^) 2 ] = hr 1 « xT l . So larger counts, 
due to larger underlying intensities, lead to small errors on a relative scale. However, the accuracy 
of our recovery also depends on the regularity of the underlying function /. If it is very smooth, for 
example a constant function, then the error would be inversely proportional to the total number 
of counts, not the number in each individual bin. This is because in the extreme smooth case, / is 
determined by a single constant. 

To give some insight into dependence of the estimate on the total number of counts, suppose 
that / is the underlying continuous intensity function of interest. Furthermore, let / be a Holder a- 
smooth function. The parameter a is related to the number of continuous derivatives / has. Larger 
values of a correspond to smoother functions. Such a model is reasonable for the application at 
hand, as discussed in our motivation for regularization above. We recall the following minimax 
lower bound, which follows from the results in fTTj I3T]. 

Theorem 1. Let f be a Holder a-smooth d- dimensional intensity function and suppose we observe 
N events from the distribution Poisson(f). Then there exists a constant C a > such that 

ENIf - f||?l -2« 
infsup l ", l ,| |2 J > C a N*+* , 

where the inflmum is over all possible estimators. The error is measured with the 1-norm, rather 
than two norm, which is a more appropriate and natural norm in density and intensity estimation. 
The theorem tells us that no estimator can achieve a faster rate of error decay than the bound 
above. There exist many types of estimators that nearly achieve this bound (e.g., to within a log 
factor), and with more work it is possible to show that our regularized estimators, with adaptively 
chosen bin sizes and appropriate regularization parameter settings, could also nearly achieve this 
rate. For the purposes of this discussion, the lower bound, which certainly applies to our situation, 
will suffice. 

For example, consider just two spatial dimensions (d — 2) and a — 1 which corresponds to 
Lipschitz smooth functions, a very mild regularity assumption. Then the bound says that the 
error is proportional to TV -1 / 2 . This gives useful insight into the minimal data requirements of our 
methods. It tells us, for example, that if we want to reduce the error of the estimator by a factor 
of say 2, then the total number of counts must be increased by a factor of 4. If the smoothness a is 
very large, then doubling the counts can halve the error. The message is simple. More events and 
higher counts will provide more accurate estimates. 

3 Related Work 

To our knowledge, there is no comparable prior work that focuses on robust single recovery from 
social media (i.e., the "second stage" as we mentioned in the introduction). However, there has 



6 



been considerable related work on the first stage, which we summarize below. 

Topic detection and tracking (TDT) aims at identifying emerging topics from text stream and 
grouping documents based on their topics. The early work in this direction began with news text 
streamed from newswire and transcribed from other media pQ. Recent research focused on user- 
generated content on the web and on the spatio-temporal variation of topics. Latent Dirichlet 
Allocation (LDA) [HE] is a popular unsupervised method to detect topics. Mei et al [18] extended 
LDA by taking spatio-temporal context into account to identify subtopics from weblogs. They 
analyzed the spatio-temporal pattern of topic 9 by p(time\9, location) and p(location\9 \time) , and 
showed that documents created from the same spatio-temporal context tend to share topics. In the 
same spirit, Yin et al [33] studied GPS-associated documents, whose coordinates are generated by 
Gaussian Mixture Model in their generative framework. Cataldi et al [5j proposed a feature-pivot 
method. They first identified keywords whose occurrences dramatically increase in a specified time 
interval and then connected the keywords to detect emerging topics. Besides text, social network 
structure also provides important information for detecting community-based topics [24J and user 
interests [T7] , 

Event detection is highly related to TDT. Yang et al [32] uses clustering algorithm to identify 
events from news stream. Others tried to distinguish posts related to real world event from non- 
events ones, such as describing daily life or emotions [3J. Such kind of events were also detected 
in Flickr photos with meta information [6] and Twitter [30] . Yet others were interested in events 
with special characteristics. Popescu et al [221 E3] focused on the detection of controversial events 
which provoke a public debate in which audience members express opposing opinions. Watanabe 
et al [29] studied smaller-scale local-events, such as sales at a supermarket. Sakaki et al [25] 
monitored Twitter to detect real-time events such as earthquakes and hurricanes. 

Another line of related work uses social media as a data source to answer scientific questions [16] . 
Most previous work studied questions in linguistic, sociology and human interactions. For exam- 
ple, Eisenstein et al [13] studied the geographic linguistic variation with geotagged social media. 
Danescu-Niculescu-Mizil et al [TO] studied the psycholinguistic theory of communication accom- 
modation with twitter conversations. Gupte et al [15J studied social hierarchy and stratification 
in online social network. Crandall et al [9j and Anagnostopoulos et al [2J tried to understand the 
social influence through the interaction on social network. 

As stated earlier, Socioscope differs from these related work in its focus on robust signal re- 
covery on predefined target phenomena. The target posts may be generated at a very low, though 
sustained, rate, and are subject to noise corruption. The above approaches are unlikely to estimate 
the underlying intensity accurately. 

4 A Synthetic Experiment 

We start with a synthetic experiment whose known ground-truth intensity f allows us to quanti- 
tatively evaluate the effectiveness of Socioscope. The synthetic experiment matches the case study 
in the next section. There are 48 US continental states plus Washington DC, and T = 24 hours. 
This leads to a total of n — 1176 source bins, and m — (2 x 49 + 1)T = 2376 detector bins. The 
transition matrix P is the same as in the case study, to be discussed later. The overall counts z are 
obtained from actual Twitter data and g = z^. 

We design the ground-truth target signal f to be temporally constant but spatially varying. 
Figure [11(a) shows the ground-truth f spatially. It is a mixture of two Gaussian distributions dis- 



7 



(i) 


scaled 


14.11 


(") 


scaled x^'/zW 


46.73 


(iii 


) Socioscope with x^ 1 ) 


0.17 


(iv 


) Socioscope with x^ 1 ) + x^ 2 ) 


1.83 


(v) 


Socioscope with x^, x( 2 ) 
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0.12 



Table 1: Relative error of different estimators 






(a) ground-truth f (b) scaled 

Figure 1: The synthetic experiment 



(c) Socioscope 



cretized at the state level. The modes are in Washington and New York, respectively. From P, f 
and g, we generate the observed target post counts for each detector bin by a Poisson random 
number generator: X{ ~ Poisson(^™ =1 Pijfjgj), i = 1 . . . m. The sum of counts in x^ 1 ) is 56, in x^ 2 ) 
1106, and in x^ 3 ) 1030. Considering the number of bins we have, the data is very sparse. 

Given x, P,g, We compare the relative error ||f — f|| 2 /||f|| 2 of several estimators in Table]!} 
(i) f = xW/(ei^z( 1 )), where ei is the fraction of tweets with precise location stamp (discussed 
later in case study). Scaling matches it to the other estimators. Figure [ijb) shows this simple 
estimator, aggregated spatially. It is a poor estimator: besides being non-smooth, it contains 32 
"holes" (states with zero intensity, colored in blue) due to data scarcity, (ii) f = x^/(eiz^) which 
naively corrects the population bias as discussed in Q. It is even worse than the simple estimator, 
because naive bin- wise correction magnifies the variance in sparse x^ 1 ). 

(iii) Socioscope with x^ only. This simulates the practice of discarding noisy or incomplete 
data, but regularizing for smoothness. The relative error was reduced dramatically. 

(iv) Same as (iii) but replace the values of x^ with x^ + x^ 2 ). This simulates the practice of 
ignoring the noise in x^ 2 ) and pretending it is precise. The result is worse than (iii), indicating that 
simply including noisy data may hurt the estimation. 

(v) Socioscope with x^ and x( 2 ) separately, where x( 2 ) is treated as noisy by P. It reduces the 
relative error further, and demonstrates the benefits of treating noisy data specially. 

(vi) Socioscope with the full x. It achieves the lowest relative error among all methods, and is 
the closest to the ground truth (Figure [jjc)). Compared to (v), this demonstrates that even counts 
x^ 3 ) without location can also help us to recover f better. 



5 Case Study: Roadkill 

We now turn to a real-world task of estimating the spatio-temporal intensity of roadkill for several 
common wildlife species from Twitter posts. The study of roadkill has values in ecology, conserva- 
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Figure 2: Human population intensity g. 



tion, and transportation safety. 

The target phenomenon consists of roadkill events for a specific species within the continen- 
tal United States during September 22-November 30, 2011. Our spatio-temporal source bins are 
state xhour-of-day. Let s index the 48 continental US states plus District of Columbia. We aggregate 
the 10-week study period into 24 hours of a day. The target counts x are still sparse even with ag- 
gregation: for example, most state-hour combination have zero counts for armadillo and the largest 
count in and x( 2 ) is 3. Therefore, recovering the underlying signal f remains a challenge. Let t 
index the hours from 1 to 24. This results in \s\ = 49, |t| = 24, n = \s\\t\ = 1176, rn = (2\s\ + l)|t| = 
2376. We will often index source or detector bins by the subscript (s, £), in addition to i or j, below. 
The translation should be obvious. 

5.1 Data Preparation 

We chose Twitter as our data source because public tweets can be easily collected through its APIs. 
All tweets include time met a data. However, most tweets do not contain location met a data, as 
discussed earlier. 

5.1.1 Overall Counts z^ 1 ) and Human Population Intensity g. 

To obtain the overall counts z, we collected tweets through the Twitter stream API using bounding 
boxes covering continental US. The API supplied a subsample of all tweets (not just target posts) 
with geo-tag. Therefore, all these tweets include precise latitude and longitude on where they 
were created. Through a reverse geocoding database (http://www.datasciencetoolkit.org), we 
mapped the coordinates to a US state. There are a large number of such tweets. Counting the 
number of tweets in each state- hour bin gave us z^, from which g is estimated. 

Figure [2] shows the estimated g. The x-axis is hour of day and y-axis is the states, ordered by 
longitude from east (top) to west (bottom). Although g in this matrix form contains full information, 
it can be hard to interpret. Therefore, we visualize aggregated results as well: First, we aggregate 
out time in g: for each state s, we compute Y^t=\9s} an d show the resulting intensity maps in 
Figure [^b). Second, we aggregate out state in g: for each hour of day £, we compute YlT=i 9s} an d 
show the daily curve in Figure [2^c) . From these two plots, we clearly see that human population 
intensity varies greatly both spatially and temporally. 
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5.1.2 Identifying Target Posts to Obtain Counts x. 

To produce the target counts x, we need to first identify target posts describing roadkill events. 
Although not part of Socioscope, we detail this preprocessing step here for reproducibility 

In step 1, we collected tweets using a keyword API. Each tweet must contain the wildlife name 
(e.g., "squirrel(s)") and the phrase "ran over". We obtained 5857 squirrel tweets, 325 chipmunk 
tweets, 180 opossum tweets and 159 armadillo tweets during the study period. However, many such 
tweets did not actually describe roadkill events. For example, "I almost ran over an armadillo on 
my longboard, luckily my cat-like reflexes saved me. " Clearly, the author did not kill the armadillo. 

In step 2, we built a binary text classifier to identify target posts among them. Following [26], the 
tweets were case-folded without any stemming or stopword removal. Any user mentions preceded 
by a "@" were replaced by the anonymized user name "@USERNAME" . Any URLs staring with 
"http" were replaced by the token "HTTPLINK". Hashtags (compound words following "#") were 
not split and were treated as a single token. Emoticons, such as ":)" or ":D", were also included as 
tokens. Each tweet is then represented by a feature vector consisting of unigram and bigram counts. 
If any unigram or bigram included animal names, we added an additional feature by replacing the 
animal name with the generic token "ANIMAL" . For example, we would created an extra feature 
"over ANIMAL" for the bigram "over raccoon" . The training data consists of 1,450 manually labeled 
tweets in August 2011 (i.e., outside our study period). These training tweets contain hundreds of 
animal species, not just the target species. The binary label is whether the tweet is a true first- 
hand roadkill experience. We trained a linear Support Vector Machine (SVM). The CV accuracy is 
nearly 90%. We then applied this SVM to classify tweets surviving step 1. Those tweets receiving 
a positive label were treated as target posts. 

In step 3, we produce 

x m (2) (3) 

counts. Because these target tweets were collected by the 
keyword API, the nature of the Twitter API means that most do not contain precise location 
information. As mentioned earlier, only 3% of them contain coordinates. We processed this 3% 
by the same reverse geocoding database to map them to a US state s, and place them in the 
detection bins. 47% of the target posts do not contain coordinates but can be mapped to a US state 

(2) 

from user self-declared profile location. These are placed in the x s [ detection bins. The remaining 
50% contained no location meta data, and were placed in the x[ 3 ^ detection bins. [^] 

5.1.3 Constructing the Transition Matrix P. 

In this study, P characterizes the fraction of tweets which were actually generated in source bin 
(s,t) end up in the three detector bins: precise location st^\ potentially noisy location st^ 2 \ and 
missing location t^\ We define P as follows: 

Pj gf j(i) = 0.03, and P^^(i) =0 for Vr ^ s to reflect the fact that we know precisely 
3% of the target posts' location. 

Pj rt j(2) ^ s = QA7M r s for all r, s. M is a 49x49 "mis-self-declare" matrix. M r ^ s is the probability 
that a user self-declares in her profile that she is in state r, but her post is in fact generated in state s. 
We estimated M from a separate large set of tweets with both coordinates and self-declared profile 
locations. The M matrix is asymmetric and interesting in its own right: many posts self-declared 
in California or New York were actually produced all over the country; many self-declared in 

3 There were actually only a fraction of all tweets without location which came from all over the world. We estimated 
this US/World fraction using z. 
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Washington DC were actually produced in Maryland or Virgina; more posts self-declare Wisconsin 
but were actually in Illinois than the other way around. 

P t (3) ( s t \ = 0.50. This aggregates tweets with missing information into the third kind of detector 
bins. 

5.1.4 Specifying the Graph Regularizer. 

Our graph has two kinds of edges. Temporal edges connect source bins with the same state and 
adjacent hours by weight wt. Spatial edges connect source bins with the same hour and adjacent 
states by weight w s . The regularization weight A was absorbed into w t and w s . We tuned the weights 
w t and w s with CV on the 2D grid {10~ 3 , 10~ 2 - 5 , . . . , 10 3 } 2 . 

5.2 Results 

We present results on four animals: armadillos, chipmunks, squirrels, opossums. Perhaps surpris- 
ingly, precise roadkill intensities for these animals are apparently unknown to science (This serves 
as a good example of the value Socioscope may provide to wildlife scientists). Instead, domain 
experts were only able to provide a range map of each animal, see the left column in Figure [3j 
These maps indicate presence/absence only, and were extracted from NatureServe [21j. In addition, 
the experts defined armadillo and opossum as nocturnal, chipmunk as diurnal, and squirrels as 
both crepuscular (active primarily during twilight) and diurnal. Due to the lack of quantitative 
ground-truth, our comparison will necessarily be qualitative in nature. 

Socioscope provides sensible estimates on these animals. For example, Figure [4^ a) shows counts 
xW + x^ 2 ) for chipmunks which is very sparse (the largest count in any bin is 3), and Figure Qb) 
the Socioscope estimate f. The axes are the same as in Figure j2[a). In addition, we present the 
state-by-state intensity maps in the middle column of Figure [3]by aggregating f spatially. The 
Socioscope results match the range maps well for all animals. The right column in Figure [3] shows 
the daily animal activities by aggregating f temporally. These curves match the animals' diurnal 
patterns well, too. 

The Socioscope estimates are superior to the baseline methods in Table [Tj Due to space limit we 
only present two examples on chipmunks, but note that similar observations exist for all animals. 
The baseline estimator of simply scaling + x^ 2 ) produced the temporal and spatial aggregates 
in Figure [5ja,b). Compared to Figure [3jb, right), the temporal curve has a spurious peak around 
4-5pm. The spatial map contains spurious intensity in California and Texas, states outside the 
chipmunk range as shown in Figure [3jb, left). Both are produced by population bias when and 
where there were strong background social media activities (see Figure [2^b,c)). In addition, the 
spatial map contains 27 "holes" (states with zero intensity, colored in blue) due to data scarcity. In 
contrast, Socioscope's estimates in Figure [3] avoid this problem by regularization. Another baseline 
estimator (x^ +x( 2 ))/z( 1 ) is shown in Figure [5^c) . Although corrected for population bias, this 
estimator lacks the transition model and regularization. It does not address data scarcity either. 

6 Future Work 

Using social media as a data source for spatio-temporal signal recovery is an emerging area. Socio- 
scope represents a first step toward this goal. There are many open questions: 
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x10 




(b) chipmunk (Tamias striatus) 

0.03 




(c) squirrel (Sciurus carolinensis and several others) 



x10 




(d) opossum (Didelphis virginiana) 

Figure 3: Socioscope estimates match animal habits well. (Left) range map from NatureServe, 
(Middle) Socioscope f aggregated spatially, (Right) f aggregated temporally. 

1. We treated target posts as certain. In reality, a natural language processing system can often 
supply a confidence. For example, a tweet might be deemed to be a target post only with probability 
0.8. It will be interesting to study ways to incorporate such confidence into our framework. 

2. The temporal delay and spatial displacement between the target event and the generation of 
a post is commonplace, as discussed in footnote [2| Estimating an appropriate transition matrix P 
from social media data so that Socioscope can handle such "point spread functions" remains future 
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(a) + x^ 2 ) (b) Socioscope f 

Figure 4: Raw counts and Socioscope f for chipmunks 
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Local Time 

(a) x« + x( 2 ) 




(b) X« + x( 2 ) 



(c) (xW+x^j/zf 



x10 

i) 



Figure 5: Examples of inferior baseline estimators. In all plots, states with zero counts are colored 
in blue. 

work. 

3. It might be necessary to include psychology factors to better model the human "sensors." 
For instance, a person may not bother to tweet about a chipmunk roadkill, but may be eager to do 
so upon seeing a moose roadkill. 

4. Instead of discretizing space and time into bins, one may adopt a spatial point process model 
to learn a continuous intensity function instead [T9j . 

Addressing these considerations will further improve Socioscope. 
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